Problem 2

Conclusion - Problem 2

From the 6 models, the best to encode/decode images is the convolutional of 128 dimension latent space, because it is able to reconstruct even details of the clothes. I would say that any of the convolutional models beat any of the fully connected models because the images are sharper and less blurry.

Convolutional vs. Fully-connected

Overall it can be seen that convolutional neural networks do a better job to encode/decode the images than the fully-connected, though the difference is not hugh. Both types of architectures are able to encode and recover all the images, however because of the nature of convolutional network, they are able to capture all the shapes in a better way. If we compare the architectures for the 16 dimensional latent space, the fully connected autoencoder retrieves some blurry images, while the convolutional autoencoder retrieves images in a much sharper way. Likewise, for the 128 dimensional latent space models, the fully connected autoencoder retrieves less blurry images, but the convolutional autoencoder is able to retrieve even some of the details in the clothes, like pictures in t-shirts or lines in pullovers.

Code dimension

It can be seen that the 16 dimensional latent space models do not do a bad job encoding/decoding the images because they are able to reconstruct at least the same type of cloth. However having a higher dimensional latent space, like 128 dimension, definitely helps to retrieve better quality images, that are closer to the original images. The general trend here is that as you have more dimensions in the latent space, you are able to save more information from the original images, thus allowing you to reconstruct them in a better way. Nonetheless, being able to encode 784 dimensions, which is the size of the original images, in only 16 dimensions and then retrieve them in a reasonable way is very interesting, since 16/784 is only 2% of the original input.

Problem 3

Problem 3 - Conclusion

Qualitatively compared to the autoencoders, I can see the following differences:

  1. The background of all reconstructed images using PCA are blurry.
  2. For the 16 and 32 first components of PCA, the reconstructed images are worse than the ones produced by the autoencoders of 16 and 32 dimension latent space. It can be seen for some PCA reconstructed images that the edges of different clothes appear in the background, e.g. there is a sneaker, and behind it you can see the shape of a pullover as a shadow.
  3. Besides the blurry background, the reconstructed images by PCA using the first 128 components look good. They can be compared with the images reconstructed by the convolutional autoencoder, since in both cases images are sharp and recover great part of the details.

Both PCA and autoencoders are able to map a high dimensional manifold (784 dimensions), where the original images lie, into a lower dimensional manifold (either 16,32 or 128 dimensions) and then reconstruct/map it again to the original higher dimensional manifold. However since PCA applies a linear mapping, it learns a linear manifold and some of the information about the images is not completely encoded when proyecting the images to the lower dimensional manifold. In contrast, autoencoders can have non-linear activation functions (sigmoid, elu, etc.) in their neurons, meaning that the mapping is done using non-linear functions, and thus they learn a non-linear manifold. Consequently, autoencoders can capture more information of the original images when proyecting the data to a lower dimensional manifold. In other words, using non-linear functions allow us to encode/decode images from a high dimensional manifold to lower dimensional manifold in a better way, and that explains why the reconstructed images with the autoencoders are higher quality than the ones reconstructed by PCA.

Problem 4

Images visually different

Conclusion images visually different

For images that are visually different (sneaker and pullover), the sequence of images along the linear path do not seem to move from one class to the other in a smooth way, the images in between the two original images are just a combination of both objects. Instead for both autoencoder types, although they are not perfect, it can be seen that the images in between the two extremes of the sequence mutate in a smoother way. We can see that from the sneaker as the principal object in the image, the pullover starts to appear little by little, until it becames the main object in the image with the sneaker hidden in the center of it.

Images visually similar

Conclusion images visually similar

For images that are visually similar (two images of pants), the sequence of images along the linear path change in a very smooth way, the curvature of one of the legs of the pant decrease with each image in the secuence until it becames the second pant with both legs straight.

The behaviour for the sequence of images interpolated in the different latent space sizes is analogous to the one along the linear path. The only difference that I can tell is between the fully connected autoencoders and the convolutional autoencoders. It seems that the curvature of the first pant is being captured correctly by the convolutional models, but not very well by the fully connected models. I am not saying that the curvature is not being capture at all, but it is not as sharp and big as the original image nor the reconstruction made by the convolutional autoencoder. Therefore, the changes in the sequence of images from pant 1 to pant 2 for the fully connected model is more difficult to see.

Problem 5

NOTE: For describing the output of each layer, I used a library called "summary" that outputs the output shape of each layer, in that way, you can calculate how many neurons a convolutional layer have. This library is going to be installed by executing the following cell.

Fully-connected

fcAE16

a)

$\mathbb{R}^{784} \to \mathbb{R}^{256} \to \mathbb{R}^{256} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{16} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{256} \to \mathbb{R}^{256} \to \mathbb{R}^{784}$

b)

Yes, it can be determined that the encoder is a submersion from the weights because:

c)

Yes, it can be determined that the decoder is a immersion from the weights because:

fcAE32

a)

$\mathbb{R}^{784} \to \mathbb{R}^{256} \to \mathbb{R}^{256} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{32} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{256} \to \mathbb{R}^{256} \to \mathbb{R}^{784}$

b)

Yes, it can be determined that the encoder is a submersion from the weights because:

c)

Yes, it can be determined that the decoder is a immersion from the weights because:

fcAE128

a)

$\mathbb{R}^{784} \to \mathbb{R}^{256} \to \mathbb{R}^{256} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{128} \to \mathbb{R}^{256} \to \mathbb{R}^{256} \to \mathbb{R}^{784}$

b)

Yes, it can be determined that the encoder is a submersion from the weights because:

c)

Yes, it can be determined that the decoder is a immersion from the weights because:

Convolutional

convAE16

a)

$\mathbb{R}^{784} \to \mathbb{R}^{18432} \to \mathbb{R}^{18432} \to \mathbb{R}^{4608} \to \mathbb{R}^{4608} \to \mathbb{R}^{16} \to \mathbb{R}^{4608} \to \mathbb{R}^{4608} \to \mathbb{R}^{18432} \to \mathbb{R}^{18432} \to \mathbb{R}^{784}$

b)

No, it cannot be determined that the encoder is a submersion from the weights because:

c)

No, it cannot be determined that the encoder is a immersion from the weights because:

convAE32

a)

$\mathbb{R}^{784} \to \mathbb{R}^{18432} \to \mathbb{R}^{18432} \to \mathbb{R}^{4608} \to \mathbb{R}^{4608} \to \mathbb{R}^{32} \to \mathbb{R}^{4608} \to \mathbb{R}^{4608} \to \mathbb{R}^{18432} \to \mathbb{R}^{18432} \to \mathbb{R}^{784}$

b)

No, it cannot be determined that the encoder is a submersion from the weights because:

c)

No, it cannot be determined that the encoder is a immersion from the weights because:

convAE128

a)

$\mathbb{R}^{784} \to \mathbb{R}^{18432} \to \mathbb{R}^{18432} \to \mathbb{R}^{4608} \to \mathbb{R}^{4608} \to \mathbb{R}^{128} \to \mathbb{R}^{4608} \to \mathbb{R}^{4608} \to \mathbb{R}^{18432} \to \mathbb{R}^{18432} \to \mathbb{R}^{784}$

b)

No, it cannot be determined that the encoder is a submersion from the weights because:

c)

No, it cannot be determined that the encoder is a immersion from the weights because:

Problem 6

Problem 6 - Conclusion

It is very clear that for conv16, conv32, conv128, fc16 and fc32 the min singular value is greater than 0, meaning that the Jacobian is full rank and consequently showing that the decoder is an immersion. However for the fc128 autoencoder, the min singular value is 2.28x10^-6, a number very close to 0. Therefore, depending on how many decimal positions we use, this can mean that the jacobian is not full rank and thus the decoder is not an immersion.

If we take a closer look at the min and max singular values and the rank of each weight matrix for the fcAE128, we can see that all layers are full rank and the min singular values are not 0, indicating that the fcAE128 decoder should be an immersion, as indicated in problem 5. However, the min singular value for the first layer is 0.00028, which is low and very close to 0. Therefore when we make the composition of layers, we can expect that number to go down even further, and that could explain why our jacobian matrix for our base image is not full rank for this autoencoder version.

Problem 7

Problem 7 - Conclusion

By looking at the tangent vector of the different autoencoders versions, we can see that the overall trend is that with a bigger latent space size, we can encode/retain more information/characteristics of the images. The autoencoders with latent space of size 16 are able to encode the shape of the pullover in this case, by retaining the information of the edges of it. The autoencoders with latent space of size 32 start to encode some of the details of the pullover, like lines or any other design that can appear in them. Finally, the autoencoder of latent space of size 128 can encode much more details of the base image. If we see the 128 tangent vectors, we can see that they are pointing out different parts of the pullover that can change, though it is difficult to understand what particular feature each tangent vector is capturing.

On the other hand, the tangent vectors corresponding to the first 128 components of PCA, seems to contain the shade of multiple classes of clothes, rather than one class at a time, like the autoencoders do with a base image from a specific class. This makes sense, since the tangent vectors of PCA are the eigenvectors, which are not generated from a base image, but from the entire set of training images. Consequently, these tangent vectors capture the entire variance between all the classes in the dataset.

Problem 8

Original Pullover transformation plots

Overall, for the autoencoders we can see that:

On the other hand, for PCA we can see that:

Next I am going to show the case for a sneaker, where the convolutional performs better for the X translation and rotation transformations.

Original Sneaker transformation plots

For this case, the convAE128 is more invariant for X translation and rotation transformations than the PCA with 128 components. Again, if we take a look at the plot of the first 128 tangent vectors of the PCA, we can see that the shape of a sneaker is present in some of them, but not all. As I mentioned before, the dominant shape in PCA's tangent vectors is the shape of pullover/shirt/t-shirt, and this would explain why for PCA the sneaker is not as invariant to these transformations as it was the pullover.

A possible explanation of why PCA is still doing a great job for the Y translation transformation for the sneaker could be the following. If we analyze the principal components images, all sneakers shapes seems to have a thick sole, and that may be making the PCA robust or invariant to the Y-shifts for this object.